Word Similarity Datasets for Indian Languages: Annotation and Baseline Systems
نویسندگان
چکیده
With the advent of word representations, word similarity tasks are becoming increasing popular as an evaluation metric for the quality of the representations. In this paper, we present manually annotated monolingual word similarity datasets of six Indian languages – Urdu, Telugu, Marathi, Punjabi, Tamil and Gujarati. These languages are most spoken Indian languages worldwide after Hindi and Bengali. For the construction of these datasets, our approach relies on translation and re-annotation of word similarity datasets of English. We also present baseline scores for word representation models using state-of-the-art techniques for Urdu, Telugu and Marathi by evaluating them on newly created word similarity datasets.
منابع مشابه
The IIT Bombay SMT System for ICON 2014 Tools Contest
In this paper, we describe our submission to the ICON 2014 Tools Contest for Machine Translation. The source languages are English, Marathi, Tamil, Telugu, Bengali and the target language is Hindi. We submitted 15 systems; 5 each for the tourism, health and general domains. Our submission is a Phrase-based Statistical Machine Translation system with preprocessing and post-processing elements. A...
متن کاملA Framework for the Construction of Monolingual and Cross-lingual Word Similarity Datasets
Despite being one of the most popular tasks in lexical semantics, word similarity has often been limited to the English language. Other languages, even those that are widely spoken such as Spanish, do not have a reliable word similarity evaluation framework. We put forward robust methodologies for the extension of existing English datasets to other languages, both at monolingual and cross-lingu...
متن کاملSemEval-2017 Task 2: Multilingual and Cross-lingual Semantic Word Similarity
This paper introduces a new task on Multilingual and Cross-lingual Semantic Word Similarity which measures the semantic similarity of word pairs within and across five languages: English, Farsi, German, Italian and Spanish. High quality datasets were manually curated for the five languages with high inter-annotator agreements (consistently in the 0.9 ballpark). These were used for semi-automati...
متن کاملImproving Reliability of Word Similarity Evaluation by Redesigning Annotation Task and Performance Measure
We suggest a new method for creating and using gold-standard datasets for word similarity evaluation. Our goal is to improve the reliability of the evaluation, and we do this by redesigning the annotation task to achieve higher inter-rater agreement, and by defining a performance measure which takes the reliability of each annotation decision in the dataset into account.
متن کاملCost Effective Dependency Parsing for Indian Languages
Indian languages are MoR-FWO1 and hence differ from English in structure and morphology. There are many distinguished characteristics possessed by Indian languages. While working with these languages we have to keep in mind, these characteristics and plan strategies accordingly. We worked on improving Dependency Parsing for Indian Languages, more specifically for Hindi, an Indo-Aryan Language. ...
متن کامل